Gradient Descent: Second Order Momentum and Saturating Error

نویسنده

  • Barak A. Pearlmutter
چکیده

Batch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum of quadratic form with a time constant no better than '4Amax/ Amin where Amin and Amax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term ~w(t) = -7JdE/dw(t) + Q'~w(t 1) improves this to ~ VAmax/ Amin, although only in the batch case. Here we show that secondorder momentum, ~w(t) = -7JdE/dw(t) + Q'~w(t -1) + (3~w(t 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Handwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns

The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...

متن کامل

To appear in the proceedings of NIPS*4, Morgan Kaufmann 1992. Gradient Descent: Second-Order Momentum and Saturating Error

Batch gradient descent, w(t) = dE=dw(t), converges to a minimum of quadratic form with a time constant no better than 1 4 max = min where min and max are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term w(t) = dE=dw(t) + w(t 1) improves this to 1 4 p max = min , although only in the batch case. Here we show that ...

متن کامل

An Investigation of the Gradient Descent Process in Neural Networks

Usually gradient descent is merely a way to find a minimum, abandoned if a more efficient technique is available. Here we investigate the detailed properties of the gradient descent process, and the related topics of how gradients can be computed, what the limitations on gradient descent are, and how the second-order information that governs the dynamics of gradient descent can be probed. To de...

متن کامل

Momentum and Optimal Stochastic Search

The rate of convergence for gradient descent algorithms, both batch and stochastic, can be improved by including in the weight update a “momentum” term proportional to the previous weight update. Several authors [1, 2] give conditions for convergence of the mean and covariance of the weight vector for momentum LMS with constant learning rate. However stochastic algorithms require that the learn...

متن کامل

On the importance of initialization and momentum in deep learning

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1991